Reward Model Deberta V3 Large V2
MIT
This reward model is trained to predict which generated answer humans would prefer for a given question. Suitable for QA evaluation, RLHF reward scoring, and toxic answer detection.
Large Language Model
Transformers English